개괄적인 작업흐름은 데이터프레임 정형데이터에서 pandoc 계열 도구와 마크다운, \(LaTex\) 언어 등을 활용 다양한 문서를 만들어내는 것이고 이렇게 다양한 문서 데이터를 다시 다양한 도구와 방법론을 사용해서 정형데이터로 만들어 낸다.

이력서 작업흐름

1 이력서 PDF 데이터

예제 데이터로 공개된 영문 Awesome CV is LaTeX template for your outstanding job application PDF 파일을 사용한다.

첫번째 페이지

library(pdftools)
library(magick)

resume_first_png <- pdf_render_page("data/resume.pdf", page = 1, dpi = 300, numeric = FALSE)
image_read(resume_first_png)

두번째 페이지

resume_second_png <- pdf_render_page("data/resume.pdf", page = 2, dpi = 300, numeric = FALSE)
image_read(resume_second_png)

2 PDF → 데이터프레임

반정형 이력서 PDF 파일에서 데이터프레임을 추출한다.

2.1 데이터 구분

library(pdftools)
library(tidyverse)

cv_dat <- pdf_text("data/resume.pdf")

cv_dat <- paste0(unlist(cv_dat), collapse = "")

cv_split_dat <- cv_dat %>% 
  str_split(pattern="\r\n") %>% 
  .[[1]]

# 인적사항 -------------------
`인적사항_idx` <- cv_split_dat %>% 
  str_detect("Summary") %>% 
  which(TRUE)

인적사항 <- cv_split_dat[1:(`인적사항_idx`-1)]

# 요약("Summary") -------------------
`요약_idx` <- cv_split_dat %>% 
  str_detect("Work Experience") %>% 
  which(TRUE)

요약 <- cv_split_dat[(`인적사항_idx`+1):(`요약_idx`-1)]

# 직장경력("Work Experience") -------------------
`직장경력_idx` <- cv_split_dat %>% 
  str_detect("Honors & Awards") %>% 
  which(TRUE)

직장경력 <- cv_split_dat[(`요약_idx`+1):(`직장경력_idx`-1)]

# 수상이력 ("Honors & Awards") -------------------
`수상이력_idx` <- cv_split_dat %>% 
  str_detect("Presentation") %>% 
  which(TRUE)

수상이력 <- cv_split_dat[(`직장경력_idx`+1):(`수상이력_idx`-1)]

# 발표("Presentation") -------------------
`발표_idx` <- cv_split_dat %>% 
  str_detect("Writing") %>% 
  which(TRUE)

발표 <- cv_split_dat[(`수상이력_idx`+1):(`발표_idx`-1)]

# 저서("Writing") -------------------
`저서_idx` <- cv_split_dat %>% 
  str_detect("Program Committees") %>% 
  which(TRUE)

저서 <- cv_split_dat[(`발표_idx`+1):(`저서_idx`-1)]

# 심사("Program Committees") -------------------
`심사_idx` <- cv_split_dat %>% 
  str_detect("Education") %>% 
  which(TRUE)

심사 <- cv_split_dat[(`저서_idx`+1):(`심사_idx`-1)]

# 학교("Education") -------------------
`학교_idx` <- cv_split_dat %>% 
  str_detect("Extracurricular") %>% 
  which(TRUE)

학교 <- cv_split_dat[(`심사_idx`+1):(`학교_idx`-1)]

# 특활활동("Extracurricular") -------------------
특활활동 <- cv_split_dat[(`학교_idx`+1):length(cv_split_dat)]

## 이력서 구분

cv_section_list <- list("인적사항" = 인적사항,
     "요약" = 요약,
     "직장경력" = 직장경력,
     "수상이력" = 수상이력,
     "발표" = 발표,
     "저서" = 저서,
     "심사" = 심사,
     "학교"=학교, 
     "특활활동"=특활활동)

listviewer::jsonedit(cv_section_list)

2.2 데이터 정형화

2.2.1 인적사항

인적사항 <- str_trim(인적사항) %>% str_remove_all(pattern="\uf10b|\uf0e0|\uf015|\uf092|\uf08c")

이름 <- 인적사항[1]
직무 <- 인적사항[2]
주소 <- 인적사항[3]

개인정보 <- str_split(인적사항[4], " \\| ") %>% .[[1]]
전화번호 <- str_trim(개인정보[1])
전자우편 <- str_trim(개인정보[2])
홈페이지 <- str_trim(개인정보[3])
GitHub   <- str_trim(개인정보[4])
링크트인 <- str_trim(개인정보[5])

인적사항_df <- tibble(
  "이름" = 이름,
  "직무" = 직무,
  "주소" = 주소,
  "전화번호" = 전화번호,
  "전자우편" = 전자우편,
  "홈페이지" = 홈페이지,
  "Github"   = GitHub,
  "링크트인" = 링크트인
) 

2.2.2 요약

요약_df <- tibble(
  "요약" = str_c(요약, collapse=" ")
)

2.2.3 요약

직장경력
 [1] "Omnious. Co., Ltd.                                                                                                                   Seoul, S.Korea"     
 [2] "SOFTWARE ARCHITECT                                                                                                               Jun. 2017 - May. 2018"  
 [3] "<U+2022> Provisioned an easily managable hybrid infrastructure(Amazon AWS + On-premise) utilizing IaC(Infrastructure as Code) tools like Ansible, Packer"
 [4] "   and Terraform."                                                                                                                                       
 [5] "<U+2022> Built fully automated CI/CD pipelines on CircleCI for containerized applications using Docker, AWS ECR and Rancher."                            
 [6] "<U+2022> Designed an overall service architecture and pipelines of the Machine Learning based Fashion Tagging API SaaS product with the micro-services"  
 [7] "   architecture."                                                                                                                                        
 [8] "<U+2022> Implemented several API microservices in Node.js Koa and in the serverless AWS Lambda functions."                                               
 [9] "<U+2022> Deployed a centralized logging environment(ELK, Filebeat, CloudWatch, S3) which gather log data from docker containers and AWS resources."      
[10] "<U+2022> Deployed a centralized monitoring environment(Grafana, InfluxDB, CollectD) which gather system metrics as well as docker run-time metrics."     
[11] "PLAT Corp.                                                                                                                           Seoul, S.Korea"     
[12] "CO-FOUNDER & SOFTWARE ENGINEER                                                                                                    Jan. 2016 - Jun. 2017" 
[13] "<U+2022> Implemented RESTful API server for car rental booking application(CARPLAT in Google Play)."                                                     
[14] "<U+2022> Built and deployed overall service infrastructure utilizing Docker container, CircleCI, and several AWS stack(Including EC2, ECS, Route 53, S3,"
[15] "   CloudFront, RDS, ElastiCache, IAM), focusing on high-availability, fault tolerance, and auto-scaling."                                                
[16] "<U+2022> Developed an easy-to-use Payment module which connects to major PG(Payment Gateway) companies in Korea."                                        
[17] "R.O.K Cyber Command, MND                                                                                                             Seoul, S.Korea"     
[18] "SOFTWARE ENGINEER & SECURITY RESEARCHER (COMPULSORY MILITARY SERVICE)                                                             Aug. 2014 - Apr. 2016" 
[19] "<U+2022> Lead engineer on agent-less backtracking system that can discover client device’s fingerprint(including public and private IP) independently of"
[20] "   the Proxy, VPN and NAT."                                                                                                                              
[21] "<U+2022> Implemented a distributed web stress test tool with high anonymity."                                                                            
[22] "<U+2022> Implemented a military cooperation system which is web based real time messenger in Scala on Lift."                                             
[23] "NEXON                                                                                                                     Seoul, S.Korea & LA, U.S.A"    
[24] "GAME DEVELOPER INTERN AT GLOBAL INTERNSHIP PROGRAM                                                                                Jan. 2013 - Feb. 2013" 
[25] "<U+2022> Developed in Cocos2d-x an action puzzle game(Dragon Buster) targeting U.S. market."                                                             
[26] "<U+2022> Implemented API server which is communicating with game client and In-App Store, along with two other team members who wrote the game"          
[27] "   logic and designed game graphics."                                                                                                                    
[28] "<U+2022> Won the 2nd prize in final evaluation."                                                                                                         
[29] "ShitOne Corp.                                                                                                                        Seoul, S.Korea"     
[30] "SOFTWARE ENGINEER                                                                                                                 Dec. 2011 - Feb. 2012" 
[31] "<U+2022> Developed a proxy drive smartphone application which connects proxy driver and customer."                                                       
[32] "<U+2022> Implemented overall Android application logic and wrote API server for community service, along with lead engineer who designed bidding"        
[33] "   protocol on raw socket and implemented API server for bidding."                                                                                       
[34] "SAMSUNG Electronics                                                                                                                            S.Korea"  
[35] "FREELANCE PENETRATION TESTER                                                                                            Sep. 2013, Mar. 2011 - Oct. 2011"
[36] "<U+2022> Conducted penetration testing on SAMSUNG KNOX, which is solution for enterprise mobile security."                                               
[37] "<U+2022> Conducted penetration testing on SAMSUNG Smart TV."